Quality of language models for distributed information retrieval
نویسنده
چکیده
Collections used in distributed information retrieval (DIR) are often described by unigram language models, composed of simple term-probability statistics. In most cases, this information is not directly available from constituent collections and must be estimated by the DIR tool itself from a sample of documents. Factors affecting the quality of such estimates are not well understood, and nor is the impact of estimate quality. Several measures of quality for unigram language models have been described, and three are used here to investigate how the quality of a model changes given document samples of differing size or quality. I show that although all models improve given larger samples, those built with more biased samples are of significantly lower quality; and that one of the three measures, Kullback-Leibler divergence, best describes model quality. Finally, it is shown that model quality has an impact on the effectiveness of standard server selection algorithms.
منابع مشابه
Improved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملبررسی تأثیرات ریشهیابی در بازیابی اطلاعات در زبان فارسی
Using the language-specific behavior in information retrieval systems can improve the quality of the retrieved results significantly. Part of the word that remains after removing its affixes is called stem. Stemming process can be used for improving the relevancy of the results in information retrieval system. Different morphological variants of words (plural, past tense…) will be mapped into t...
متن کاملکاربست مدل بازیابی تخصص برای یافتن نویسندگان خبره
This research applied Expertise Retrieval model for finding expert authors, and used evaluation methods of Information Retrieval systems for measuring the performance of those models. Current research is an experimental one. Besides, a variety of methods including survey method has been used in the research process. Various models were developed for finding expert authors, all built on a known ...
متن کاملA Distributed Intelligent Agent Approach to Context in Information Retrieval
Information retrieval across disadvantaged networks requires intelligent agents that can make decisions about what to transmit in such a way as to minimize network performance impact while maximizing utility and quality of information (QOI). Specialized agents at the source need to process unstructured, ad-hoc queries, identifying both the context and the intent to determine the implied task. K...
متن کامل